skip to main content


Search for: All records

Creators/Authors contains: "Wei, Alexander"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes. 
    more » « less
  2. Colloidal Ag particles decorated with Fe 3 O 4 islands can be electrochemically or photochemically activated as inverse catalysts for C(sp 2 )–H heteroarylation. The silver–iron oxide (SIO) particles are reduced into redox-active forms by cathodic charging at mild potentials or by short-term light exposure, and can be reused multiple times by magnetic cycling without further activation. A negative shift in the reduction peak is attributed to an overpotential produced by surface Fe 3 O 4 which separates residual Ag ions or clusters from bulk silver. The catalytic efficiency of SIO is maintained even with acid degradation, which can be countered simply by adding water to the reaction medium. 
    more » « less
  3. Roll-to-roll printing has significantly shortened the time from design to production of sensors and IoT devices, while being cost-effective for mass production. But due to less manufacturing tolerance controls available, properties such as sensor thickness, composition, roughness, etc., cannot be precisely controlled. Since these properties likely affect the sensor behavior, roll-to-roll printed sensors require validation testing before they can be deployed in the field. In this work, we improve the testing of Nitrate sensors that need to be calibrated in a solution of known Nitrate concentration for around 1–2 days. To accelerate this process, we observe the initial behavior of the sensors for a few hours, and use a physics-informed machine learning method to predict their measurements 24 hours in the future, thus saving valuable time and testing resources. Due to the variability in roll-to-roll printing, this prediction task requires models that are robust to changes in properties of the new test sensors. We show that existing methods fail at this task and describe a physics-informed machine learning method that improves the prediction robustness to different testing conditions (≈ 1.7× lower in real-world data and ≈ 5× lower in synthetic data when compared with the current state-of-the-art physics-informed machine learning method). 
    more » « less